An Analysis of Tobacco in Films and Youth Tobacco Usage

Cole Dorsey

Introduction

According to the CDC, cigarette smoking causes about one of every five deaths in the United States each year. Cigarette smoking is estimated to 1. cause more than 480,000 deaths annually (including deaths from secondhand smoke) 2. cause 278,544 deaths annually among men (including deaths from secondhand smoke) and 3. cause 201,773 deaths annually among women (including deaths from secondhand smoke). The aforementioned statistics resoundingly support the conclusion that smoking kills. Despite this conclusion being accepted widely for nearly 6 decades, the film making industry perpetually glorifies the consumption of tobacco in all formats. Ultimately, this begs the question, is there a significant correlation between tobacco incidents in films and youth tobacco usage? Are there any patterns between production companies and propensity to include tobacco usage in their films? How about directors and their propensity to include tobacco usage in their films? How has the smoking medium within film changed(cigarette, pipe, e-cigarette, smokeless)? If there is a change in smoking medium throughout film over time, is this reflected in youth smoking medium similarly changing?

Hypotheses

We set out to attempt to answer the aforementioned questions. Before embarking on our proverbial “hunt for the grail” we as a group had certain hypotheses that we used as guidance while navigating the metaphorical data landscape bound by 1’s and 0’s in this ether of cyberspace.

Namely, we hypothesized the following:

  1. The presence of smoking cigarettes in cinema has decreased over time, while the presence of alternative smoking means (e-cigarettes, etc) has increased. 

  2. The overall number of tobacco incidents in cinema has decreased over time.

  3. There is a correlation between tobacco incidents within films and youth tobacco usage. (Proving causality would be unrealistic given the constraints of working with data and not a controlled study)

  4. Production companies like Disney produce the least amount of tobacco incidents over time. The propensity for a movie to include tobacco usage on screen depends on the production company - certain production companies are more likely to produce movies with onscreen tobacco usage.

  5. Quentin Tarantino films feature the most amount of tobacco incidents and this correlates positively with youth smoking rates. The propensity for a movie to include tobacco usage depends heavily on the director - certain directors are more likely to produce movies with onscreen tobacco usage.

#IMPORTING LIBRARIES
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(gtable)
library(wordcloud)
## Loading required package: RColorBrewer
#LOADING IN DATA
movie_data <- read.csv("/Users/coledorsey/Downloads/movie_video_production-2022-11-22PST18-14-30.csv")

#CLEANING DATA

#ELIMINATING OLD ID COLUMNS
clean_movie_data <- select(movie_data,  -3, -4,-41,-40,)

#SPACING OUT RELEASE DATE INTO SEPARATE MONTH, DAY, AND YEAR COLUMNS
clean_movie_data2 <- separate(clean_movie_data, col = release_date, into=c('release_year', 'release_month', 'release_day'), sep='-')



#DELETING MORE NONESSENTIAL COLUMNS
final_clean <- select(clean_movie_data2, -c("format","descriptor_text", "credited_production_companies", "mpa_descriptor"))


#REPLACING NA VALUES WITH ZEROES.
final_clean[is.na(final_clean)] = 0
#LOADING IN SMOKING HABITS DATA
smoking_habits <- read.csv("/Users/coledorsey/Downloads/Youth_Tobacco_Survey__YTS__Data.csv")

#DATA CLEANING

#REARRANGING BY YEAR
smoking_habits_arranged  <- smoking_habits[order(smoking_habits$YEAR),]


#DELETING UNESSENTIAL COLUMNS
final_smoking <- subset(smoking_habits_arranged, select=-c(DataSource,Data_Value_Footnote_Symbol,StratificationID1,StratificationID2,StratificationID3, StratificationID4,SubMeasureID,DisplayOrder,TopicTypeId))


#FILTERING BY CIGARETTE USE YOUTH
cigarette_use <- filter(final_smoking, TopicDesc == "Cigarette Use (Youth)")

#FILTERING TO BE IRRESPECTIVE OF GENDER
cigarette_use_genderless <- filter(cigarette_use, Gender == "Overall")


#FILTERINGS BY CURRENT SMOKERS
cigarette_use_genderless_current <- filter(cigarette_use_genderless, Response == "Current")

#FILTERINGS BY FREQUENT SMOKERS
cigarette_use_genderless_frequent <- filter(cigarette_use_genderless, Response == "Frequent")

#FILTERING BY SMOKELESS TOBACCO USE YOUTH
smokeless_tobacco_use <- filter(final_smoking, TopicDesc == "Smokeless Tobacco Use (Youth)")

#GENDERLESS SMOKELESS TOBACCO USE 
smokeless_tobacco_use_genderless <- filter(smokeless_tobacco_use, Gender == "Overall")

#FILTERINGS BY CURRENT SMOKERS
smokeless_tobacco_use_genderless_current <- filter(smokeless_tobacco_use_genderless, Response == "Current")

#FILTERING BY FREQUENT SMOKERS
smokeless_tobacco_use_genderless_frequent <- filter(smokeless_tobacco_use_genderless, Response == "Frequent")

#FINAL SMOKELESS CURRENT - GROUPING BY YEAR AND CALCULATING TOTAL CURRENT SMOKELESS PER YEAR
smokeless_tobacco_use_genderless_current1 <- smokeless_tobacco_use_genderless_current %>% group_by(YEAR) %>% summarise(num = n(), current_smokeless_per_year = mean(Data_Value)) 

#FINAL SMOKELESS FREQUENT - GROUPING BY YEAR AND ADDING TOTAL FREQUENT SMOKELESS PER YEAR
smokeless_tobacco_use_genderless_frequent1 <- smokeless_tobacco_use_genderless_frequent %>% group_by(YEAR) %>% summarise(num = n(), frequent_smokeless_per_year = mean(Data_Value)) 


#CALCULATING FINAL CIGARETTE CURRENT SMOKERS PER YEAR
cigarette_use_genderless_current1 <- cigarette_use_genderless_current %>% group_by(YEAR) %>% summarise(num = n(), current_smokers_per_year = mean(Data_Value)) 

#CALCULATING FINAL CIGARETTE FREQUENT SMOKERS PER YEAR
cigarette_use_genderless_frequent1 <- cigarette_use_genderless_frequent %>% group_by(YEAR) %>% summarise(num = n(), frequent_smokers_per_year = mean(Data_Value)) 
final_clean_new <- final_clean %>% group_by(release_year) %>% summarise(num = n(), incidents_per_year = sum(incidents_total))

final_clean_new2 <- final_clean %>% group_by(release_year) %>% summarise(num = n(), impressions_per_year = sum(impressions))

par(mfrow = c(1,2))


barplot(final_clean_new$incidents_per_year, names.arg = final_clean_new$release_year, xlab = "Year", ylab = "Tobacco Incidents", main = "Tobacco Incidents in Films", col = "#69b3a2")

barplot(final_clean_new2$impressions_per_year, names.arg = final_clean_new2$release_year, xlab = "Year", ylab = "Tobacco Impressions", main = "Tobacco Impressions in Films", col = "#69b3a2")

Surprisingly, both “Tobacco incidents” - the occurrence of smoking or other tobacco use in a movie - and “Tobacco Impressions” - one person seeing one incident - still remains high through 2019.

From the year 2003 to 2021, the number of smoking incidents within films per year increased 12.82% from 2561 to 2912.

Despite 2010 having the all time low of # tobacco Impressions within cinema, that number tragically increased 47% from 14,723,189,482 in 2010 to 23,904,152,750 in 2019.

#Sorting Tobacco Incidents in Top-Grossing Movies by Motion-Picture-Association(MPA) Rating. 

#CREATING NEW PG AND G FILMS FILTERED DATA SET
PG_G_films <- filter(final_clean, rating == "PG" | rating == "G")


#CALCULATING PG-G FILM TOBACCO INCIDENTS PER YEAR 
pg_g_new_incidents <- PG_G_films %>% group_by(release_year) %>% summarise(num = n(), incidents_per_year = sum(incidents_total)) 

#DELETING FIRST EMPTY ROW OF YEAR 2000 DATA
pg_g_new_incidents<- pg_g_new_incidents %>%  filter(!row_number() %in% c(1))

#CALCULATING PG-G FILM TOBACCO IMPRESSIONS PER YEAR
pg_g_new_impressions <- PG_G_films %>% group_by(release_year) %>% summarise(num = n(), impressions_per_year = sum(impressions)) 

#DELETING FIRST EMPTY ROW OF YEAR 2000 DATA
pg_g_new_impressions<- pg_g_new_impressions %>%  filter(!row_number() %in% c(1))

#FILTERING BY PG-13 FILMS 
PG_13_films <- filter(final_clean, rating == "PG-13")

#CALCULATING PG-13 FILM TOBACCO INCIDENTS BY YEAR
pg_13_new_incidents <- PG_13_films %>% group_by(release_year) %>% summarise(num = n(), incidents_per_year = sum(incidents_total)) 

#CALCULATING PG-13 FILM TOBACCO IMPRESSIONS BY YEAR
pg_13_new_impressions <- PG_13_films %>% group_by(release_year) %>% summarise(num = n(), impressions_per_year = sum(impressions)) 



#FILTERING BY R RATED FILMS
R_films <- filter(final_clean, rating == "R")

#FILTERING R FILMS BY RELEASE YEAR AND CALCULATING TOBACCO INCIDENTS PER YEAR
r_new_incidents <- R_films %>% group_by(release_year) %>% summarise(num = n(), incidents_per_year = sum(incidents_total)) 

#FILTERING R FILMS BY RELEASE YEAR AND CALCULATING TOBACCO IMPRESSIONS PER YEAR
r_new_impressions <- R_films %>% group_by(release_year) %>% summarise(num = n(), impressions_per_year = sum(impressions)) 

par(mfrow = c(1,2),cex.main =.5)

#CREATING PLOT OF RATING AND TOBACCO INCIDENTS
simple <- plot(r_new_incidents$release_year,r_new_incidents$incidents_per_year, col ='blue', xlab = "Year", ylab = "Tobacco Incidents", main = "Tobacco Incidents By \n Motion Picture of America Association Rating")
lines(r_new_incidents$release_year, r_new_incidents$incidents_per_year, col='blue',lwd=2)
lines(pg_13_new_incidents$release_year, pg_13_new_incidents$incidents_per_year, col='green',lwd=2)
lines(pg_g_new_incidents$release_year, pg_g_new_incidents$incidents_per_year, col='red',lwd=2)
legend(x = "topright", title="Motion Picture of \nAmerica Association Rating", legend = c("R Films", "PG-13 Films","PG & G Films"), fill = c("blue", "green", "red"), cex =.25)

#CREATING PLOT OF TOBACCO IMPRESSIONS
simple2 <-  plot(r_new_impressions$release_year,r_new_impressions$impressions_per_year, col ='blue', xlab = "Year", ylab = "Tobacco Impressions", main = "Tobacco Impressions By \nMotion Picture of America Association Rating")
lines(r_new_impressions$release_year, r_new_impressions$impressions_per_year, col='blue',lwd=2)
lines(pg_13_new_impressions$release_year, pg_13_new_impressions$impressions_per_year, col='green',lwd=2)
lines(pg_g_new_impressions$release_year, pg_g_new_impressions$impressions_per_year, col='red',lwd=2)
legend(x = "topleft", title="Motion Picture of America Association Rating", legend = c("R Films", "PG-13 Films","PG & G Films"), fill = c("blue", "green", "red"), cex =.3)

#CASTING YEAR AS INTEGER FOR OUR NEW DATAFRAME
r_new_incidents$release_year <- as.integer(r_new_incidents$release_year)
r_new_impressions$release_year <- as.integer(r_new_impressions$release_year)

#CREATING NEW DATAFRAME OF RATINGS AND TOBACCO INCIDENTS
rating_incidents_dataframe <- data.frame(Year = r_new_incidents$release_year,r_incidents = r_new_incidents$incidents_per_year, pg_g_incidents = pg_g_new_incidents$incidents_per_year, pg_13_incidents = pg_13_new_incidents$incidents_per_year)

#CREATING NEW DATAFRAME OF RATINGS AND TOBACCO IMPRESSIONS
rating_impressions_dataframe <- data.frame(Year = r_new_impressions$release_year,r_impressions = r_new_impressions$impressions_per_year, pg_g_impressions = pg_g_new_impressions$impressions_per_year, pg_13_impressions = pg_13_new_impressions$impressions_per_year)

#CREATING FIRST GRAPH  OF TOBACCO INCIDENTS BY MOVIE RATING
tobak1 <- ggplot(rating_incidents_dataframe) +
  geom_area(aes(x= Year,y= pg_g_incidents),fill=2,alpha=1.5) +
  geom_area(aes(x=Year,y=pg_13_incidents),fill=3,alpha=0.5) +
  geom_area(aes(x=Year,y=r_incidents),fill=4,alpha=.2) +
  labs(title = "Tobacco Incidents In Movies by MPAA Rating",x = "Year",
       y = "Tobacco Incidents",  subtitle = "Blue = R, Green = PG-13, Red = PG/G") 

#CREATING SECOND GRAPH OF TOBACCO IMPRESSIONS BY MOVIE RATING
 tobak <- ggplot(rating_impressions_dataframe) +
  geom_area(aes(x= Year,y= pg_g_impressions),fill=2,alpha=1.5) +
  geom_area(aes(x=Year,y=pg_13_impressions),fill=3,alpha=0.5) +
  geom_area(aes(x=Year,y=r_impressions),fill=4,alpha=.2) +
  labs(title = "Tobacco Impressions In Movies by MPAA Rating",x = "Year",
       y = "Tobacco Impressions", subtitle = "Blue = R, Green = PG-13, Red = PG/G") 

 
 #PLACING GRAPHS SIDE BY SIDE
 grid.arrange(tobak1, tobak)

The second graph illustrates a point of concern. Despite R rated movies dominating onscreen tobacco incidents. PG-13 movies have been a close contender with R movies when it comes to tobacco impressions. Considering younger individuals are able to see PG-13 movies, this begs the question whether PG-13 movies may be doing more damage than R rated movies because PG-13 movies and their tobacco incidents are more easily accessible and watchable by younger individuals - potentially impressioning younger individuals to smoke.

library(ggplot2)
#Defining a new dataframe based just on different smoking mediums
smoking_mediums_data<- data.frame(year=final_clean$release_year,cigarette=final_clean$cigarettes,cigar=final_clean$cigars,pipes=final_clean$pipes,ecigs=final_clean$e.cigs,hookah=final_clean$hookah, smokeless=final_clean$smokeless, incident_total=final_clean$incidents_total)



#CALCULATING TOTAL TOBACCO INCIDENTS PER YEAR 
incident_total_new <- smoking_mediums_data %>% group_by(year) %>% summarise(num = n(), incidents_per_year = sum(incident_total)) 

#CALCULATING TOTAL CIGARETTE INCIDENTS PER YEAR
cigarette_total_new <- smoking_mediums_data %>% group_by(year) %>% summarise(num = n(), cigarette_per_year = sum(cigarette))

#CALCULATING TOTAL SMOKELESS INCIDENTS PER YEAR
smokeless_total_new <- smoking_mediums_data %>% group_by(year) %>% summarise(num = n(), smokeless_per_year = sum(smokeless))

#CALCULATING TOTAL CIGAR INCIDENTS PER YEAR
cigar_total_new <- smoking_mediums_data %>% group_by(year) %>% summarise(num = n(), cigar_per_year = sum(cigar))

#CALCULATING TOTAL PIPE INCIDENTS PER YEAR
pipe_total_new <- smoking_mediums_data %>% group_by(year) %>% summarise(num = n(), pipes_per_year = sum(pipes))

#CALCULATING TOTAL ECIGS INCIDENTS PER YEAR
ecigs_total_new <- smoking_mediums_data %>% group_by(year) %>% summarise(num = n(), ecigs_per_year = sum(ecigs)) 

#CALCULATING TOTAL HOOKAH INCIDENTS PER YEAR
hookah_total_new <- smoking_mediums_data %>% group_by(year) %>% summarise(num = n(), hookah_per_year = sum(hookah)) 

#CREATING NEW TYPE OF TOBACCO MATRIX
type_of_tobacco_matrix <- data.frame(year=ecigs_total_new$year,cigarette=cigarette_total_new$cigarette_per_year,cigar=cigar_total_new$cigar_per_year,pipes=pipe_total_new$pipes_per_year,ecigs=ecigs_total_new$ecigs_per_year, smokeless=smokeless_total_new$smokeless_per_year, hookah=hookah_total_new$hookah_per_year,incident_total=incident_total_new$incidents_per_year)

#CREATING NEW GROUPING BY YEAR TO ALLOW FOR STACKED BAR CHART
Year <- c(rep("2000",6), rep("2002",6),rep("2003",6),rep("2004",6),rep("2005",6),rep("2006",6),rep("2007",6),rep("2008",6),rep("2009",6),rep("2010",6),rep("2011",6),rep("2012",6),rep("2013",6),rep("2014",6),rep("2015",6),rep("2016",6),rep("2017",6),rep("2018",6),rep("2019",6),rep("2020",6),rep("2021",6),rep("2022",6))

Tobacco_Medium <- rep(c("cigarette" , "cigar" , "pipe", "ecigs", "smokeless", "hookah") , 22)

#CREATING EMPTY LIST 
Tobacco_Incidents<- c()

#ITERATING OVER TYPE OF TOBACCO MATRIX AND ESSENTIALLY FLATTENING THE MATRIX AND ADDING TO NEW VECTOR CALLED TOBACCO INCIDENTS
for (row in 1:nrow(type_of_tobacco_matrix)){
  for (item in 2:(length(type_of_tobacco_matrix[row,])-1)) {
    current_item <- type_of_tobacco_matrix[row,][item]
    Tobacco_Incidents <- append(Tobacco_Incidents,current_item)
  }
}
#UNLISTING TO MAKE SURE EACH NEW VECTOR IS FLATTENED
Year <- unlist(Year)
Tobacco_Medium <- unlist(Tobacco_Medium)
Tobacco_Incidents <- unlist(Tobacco_Incidents)

#ENSURING THE LENGTHS ARE THE SAME
lengths(Year)
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
lengths(Tobacco_Medium)
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
lengths(Tobacco_Incidents)
## cigarette     cigar     pipes     ecigs smokeless    hookah cigarette     cigar 
##         1         1         1         1         1         1         1         1 
##     pipes     ecigs smokeless    hookah cigarette     cigar     pipes     ecigs 
##         1         1         1         1         1         1         1         1 
## smokeless    hookah cigarette     cigar     pipes     ecigs smokeless    hookah 
##         1         1         1         1         1         1         1         1 
## cigarette     cigar     pipes     ecigs smokeless    hookah cigarette     cigar 
##         1         1         1         1         1         1         1         1 
##     pipes     ecigs smokeless    hookah cigarette     cigar     pipes     ecigs 
##         1         1         1         1         1         1         1         1 
## smokeless    hookah cigarette     cigar     pipes     ecigs smokeless    hookah 
##         1         1         1         1         1         1         1         1 
## cigarette     cigar     pipes     ecigs smokeless    hookah cigarette     cigar 
##         1         1         1         1         1         1         1         1 
##     pipes     ecigs smokeless    hookah cigarette     cigar     pipes     ecigs 
##         1         1         1         1         1         1         1         1 
## smokeless    hookah cigarette     cigar     pipes     ecigs smokeless    hookah 
##         1         1         1         1         1         1         1         1 
## cigarette     cigar     pipes     ecigs smokeless    hookah cigarette     cigar 
##         1         1         1         1         1         1         1         1 
##     pipes     ecigs smokeless    hookah cigarette     cigar     pipes     ecigs 
##         1         1         1         1         1         1         1         1 
## smokeless    hookah cigarette     cigar     pipes     ecigs smokeless    hookah 
##         1         1         1         1         1         1         1         1 
## cigarette     cigar     pipes     ecigs smokeless    hookah cigarette     cigar 
##         1         1         1         1         1         1         1         1 
##     pipes     ecigs smokeless    hookah cigarette     cigar     pipes     ecigs 
##         1         1         1         1         1         1         1         1 
## smokeless    hookah cigarette     cigar     pipes     ecigs smokeless    hookah 
##         1         1         1         1         1         1         1         1 
## cigarette     cigar     pipes     ecigs smokeless    hookah cigarette     cigar 
##         1         1         1         1         1         1         1         1 
##     pipes     ecigs smokeless    hookah 
##         1         1         1         1
#CREATING NEW DATA FRAME
data <- data.frame(Year,Tobacco_Medium,Tobacco_Incidents)

#PLOTTING STACKED BAR CHART BASED ON OUR NEW GROUPINGS
ggplot(data, aes(fill=Tobacco_Medium, y=Tobacco_Incidents, x=Year, main)) +
  geom_bar(position ="stack", stat="identity") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + ggtitle("Tobacco Medium Over Time") 

Our initial hypothesis that, “the presence of smoking cigarettes in cinema has decreased over time, while the presence of alternative smoking means (e-cigarettes, etc) has increased” is disproved with this graph. Cigarette incidents in films remained relatively high from 2000 to 2019; however, in 2020 cigarette tobacco incidents decreased 145% from 3911 incidents to 621 cigarette incidents. This decrease is something to applaud; however, in 2021 cigarette tobacco incidents increased 120% from 2020, up to 2499 cigarette incidents. Surprisingly, e-cigarette incidents usage did not increase in films throughout the 2000’s. 2019 saw the greatest increase of e-cigarette usage in films. Interestingly enough this drastic increase mostly comes from the hit series Euphoria - initially released in 2019.

#PARENT PRODUCTION COMPANY ANALYSIS

#SORTING BY DISNEY
disney_data <- filter(final_clean, parent_company == "Disney")

#CALCULATING DISNEY TOBACCO INCIDENTS
disney_incidents <- sum(disney_data$incidents_total)

#SORTING BY WARNER BROS
warner_bros_data <- filter(final_clean, parent_company == "Warner Bros. Discovery, Inc.")

#CALCULATING WARNER BROS TOBACCO INCIDENTS
warnerbros_incidents <- sum(warner_bros_data$incidents_total)

#SORTING BY SONY 
sony_data <- filter(final_clean, parent_company == "Sony")

#CALCULATING SONY TOBACCO INCIDENTS
sony_incidents <- sum(sony_data$incidents_total)

#SORTING BY COMCAST
comcast_data <- filter(final_clean, parent_company == "Comcast")

#CALCULATING COMCAST TOBACCO INCIDENTS
comcast_incidents <- sum(comcast_data$incidents_total)

#SORTING BY VIACOM 
viacom_cbs_data <- filter(final_clean, parent_company == "ViacomCBS")

#CALCULATING VIACOM TOBACCO INCIDENTS
viacomcbs_incidents <- sum(viacom_cbs_data$incidents_total)

#SORTING BY INDEPENDENT PRODCUTION COMPANY(LIONSGATE, IMAX, MGM, DREAMWORKS)
independents_data <- filter(final_clean, parent_company == "Independents")

#CALCULATING INDEPENDENTS TOBACCO INCIDENTS
independents_incidents <- sum(independents_data$incidents_total)

#SPECEFIC INDEPENDENT PRODUCTION COMPANY
lionsgate_indie_data <- filter(final_clean, indie_name == "Lionsgate")

#CALCULATING LIONSGATE TOBACCO INCIDENTS
lionsgate_incident <- sum(lionsgate_indie_data$incidents_total)

#SORTING BY IMAX
imax_indie_data <- filter(final_clean, indie_name == "IMAX")

#IMAX TOBACCO INCIDENTS
imax_incidents <- sum(imax_indie_data$incidents_total)

#MGM SORTING
mgm_indie_data <- filter(final_clean, indie_name == "MGM")

#MGM TOBACCO INCIDENTS
mgm_incidents <- sum(mgm_indie_data$incidents_total)

#DREAMWORKS SORTING
dreamworks_indie_data <- filter(final_clean, indie_name == "DreamWorks")

#DREAMWORKS INCIDENTS
dreamworks_incidents <- sum(dreamworks_indie_data$incidents_total)

#MIRAMAX SORTING
miramax_indie_data <- filter(final_clean, indie_name == "Miramax")

#MIRAMAX TOBACCO INCIDENTS
miramax_incidents <- sum(miramax_indie_data$incidents_total)





#CREATING AREA CHART BASED ON PRODUCTION COMPANY OVER TIME.
sony_new <- sony_data %>% group_by(release_year) %>% summarise(num = n(), incidents_per_year = sum(incidents_total)) 

disney_ne <- disney_data %>% group_by(release_year) %>% summarise(num = n(), incidents_per_year = sum(incidents_total)) 

#ERASING FIRST ROW OF YEAR 2000
disney_new <- disney_ne %>%  filter(!row_number() %in% c(1))

#CREATING WARNER BROS INCIDENT PER YEAR
warner_new <- warner_bros_data %>% group_by(release_year) %>% summarise(num = n(), incidents_per_year = sum(incidents_total)) 

#CREATING COMCAST INCIDENTS PER YEAR
comcast_new <- comcast_data %>% group_by(release_year) %>% summarise(num = n(), incidents_per_year = sum(incidents_total)) 

#CREATING VIACOM INCIDENTS PER YEAR
viacom_new <- viacom_cbs_data %>% group_by(release_year) %>% summarise(num = n(), incidents_per_year = sum(incidents_total)) 

#CREATING LIONSGATE INCIDENTS PER YEAR
lionsgate_new <- lionsgate_indie_data %>% group_by(release_year) %>% summarise(num = n(), incidents_per_year = sum(incidents_total)) 

#CREATING MGM INCIDENTS PER YEAR
mgm_new <- mgm_indie_data %>% group_by(release_year) %>% summarise(num = n(), incidents_per_year = sum(incidents_total)) 

#CREATING DREAMWORKS INCIDENTS PER YEAR
dreamworks_new <- dreamworks_indie_data %>% group_by(release_year) %>% summarise(num = n(), incidents_per_year = sum(incidents_total)) 

#CREATING MIRAMAX INCIDENTS PER YEAR
miramax_new <- miramax_indie_data %>% group_by(release_year) %>% summarise(num = n(), incidents_per_year = sum(incidents_total)) 

#INDEPENDENT COMPANIES MISSING DATA POINTS - ADDING THEM IN TO MAKE DATAFRAME UNIFORM
df2 = data.frame(release_year=c(2011,2012,2013,2014,2015,2017,2018,2019,2020,2021,2022),num = c(0,0,0,0,0,0,0,0,0,0,0), incidents_per_year = c(0,0,0,0,0,0,0,0,0,0,0))
miramax_new <- rbind(miramax_new, df2)

#INDEPENDENT COMPANIES MISSING DATA POINTS - ADDING THEM IN TO MAKE DATAFRAME UNIFORM
dh2 = data.frame(release_year=c(2007,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022),num = c(0,0,0,0,0,0,0,0,0,0,0,0,0), incidents_per_year = c(0,0,0,0,0,0,0,0,0,0,0,0,0))
dreamworks_new <- rbind(dreamworks_new,dh2)


#INDEPENDENT COMPANIES MISSING DATA POINTS - ADDING THEM IN TO MAKE DATAFRAME UNIFORM
dy2 = data.frame(release_year=c(2011,2012,2013,2014,2015,2016,2020,2022),num = c(0,0,0,0,0,0,0,0), incidents_per_year = c(0,0,0,0,0,0,0,0))
mgm_new <- rbind(mgm_new, dy2)

production_company_matrix <- data.frame(year=sony_new$release_year, sony_incidents=sony_new$incidents_per_year, disney_incidents=disney_new$incidents_per_year, warner_incidents=warner_new$incidents_per_year, comcast_incidents=comcast_new$incidents_per_year, viacom_incidents=viacom_new$incidents_per_year, lionsgate_incidents=lionsgate_new$incidents_per_year, mgm_incidents=mgm_new$incidents_per_year,                           dreamworks_incidents=dreamworks_new$incidents_per_year, miramax_incidents=miramax_new$incidents_per_year)

production_company_matrix$year <- as.integer(production_company_matrix$year)

Year <- c(rep("2002",9),rep("2003",9),rep("2004",9),rep("2005",9),rep("2006",9),rep("2007",9),rep("2008",9),rep("2009",9),rep("2010",9),rep("2011",9),rep("2012",9),rep("2013",9),rep("2014",9),rep("2015",9),rep("2016",9),rep("2017",9),rep("2018",9),rep("2019",9),rep("2020",9),rep("2021",9),rep("2022",9))

Production_Company <- rep(c("Sony Tobacco Incidents" , "Disney Tobacco Incidents" , "Warner Tobacco Incidents", "Comcast Tobacco Incidents", "Viacom Tobacco Incidents", "Lionsgate Tobacco Incidents", "MGM Tobacco Incidents", "Dreamworks Tobacco Incidents", "Miramax Tobacco Incidents") , 21)

Tobacco_Incidents<- c()

#CREATING FOR LOOP TO ITERATRE OVER PRODUCTION COMPANY MATRIX AND ESSENTIALLY FLATTEN DATA
for (row in 1:nrow(production_company_matrix)){
  for (item in 2:(length(production_company_matrix[row,]))) {
    current_item <- production_company_matrix[row,][item]
    Tobacco_Incidents <- append(Tobacco_Incidents,current_item)
  }
}
Year <- unlist(Year)
Production_Company <- unlist(Production_Company)
Tobacco_Incidents <- unlist(Tobacco_Incidents)

lengths(Year)
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1
lengths(Production_Company)
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1
lengths(Tobacco_Incidents)
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1 
##       sony_incidents     disney_incidents     warner_incidents 
##                    1                    1                    1 
##    comcast_incidents     viacom_incidents  lionsgate_incidents 
##                    1                    1                    1 
##        mgm_incidents dreamworks_incidents    miramax_incidents 
##                    1                    1                    1
data <- data.frame(Year,Production_Company,Tobacco_Incidents)

#CREATING STACKED BAR CHART
ggplot(data, aes(fill=Production_Company, y=Tobacco_Incidents, x=Year)) +
 geom_bar(position="stack", stat="identity") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +ggtitle("Production Company Tobacco Incidents Over Time")

Surprisingly, Lionsgate (an indie production company) dominated the tobacco incidents charts in the year 2016 with a total of 1026 tobacco incidents. All other indie production companies (Viacom, MGM, Dreamworks, and Miramax) failed to top the charts. Thankfully, the year 2020 and 2021 saw a drastic decrease in total tobacco incidents within film. While it’s still uncertain whether this trend will remain the norm, its an indicator of positive change.

plot(comcast_new$release_year,comcast_new$incidents_per_year, col ='blue', xlab = "Year", ylab = "Tobacco Incidents", main = "Tobacco Incidents Per Production Company Over Time", sub = "Among Only Non Indie Prod. Companies")
lines(comcast_new$release_year, comcast_new$incidents_per_year, col='blue',lwd=2)
lines(disney_new$release_year, disney_new$incidents_per_year, col='green',lwd=2)
lines(warner_new$release_year, warner_new$incidents_per_year, col='red',lwd=2)
lines(sony_new$release_year, sony_new$incidents_per_year, col='purple',lwd=2)
#lines(viacom_new$release_year, viacom_new$incidents_per_year, col='yellow',lwd=2)
#lines(mgm_new$release_year, mgm_new$incidents_per_year, col='wheat',lwd=2)
#lines(dreamworks_new$release_year, dreamworks_new$incidents_per_year, col='lightpink1',lwd=2)
#lines(miramax_new$release_year, miramax_new$incidents_per_year, col='darkolivegreen',lwd=2)
#lines(lionsgate_new$release_year, lionsgate_new$incidents_per_year, col='cyan',lwd=2) 
legend(x = "topright", title="Production Company", legend = c("Comcast", "Disney",  "Warner Bros", "Sony"), fill = c("blue", "green", "red", "purple"), cex =.45)

#CALCULATING AVERAGE TOBACCO INCIDENTS PER YEAR BY PRODUCTION COMPANY
mean(comcast_new$incidents_per_year)
## [1] 501.8571
mean(disney_new$incidents_per_year)
## [1] 419.3333
mean(warner_new$incidents_per_year)
## [1] 468.619
mean(sony_new$incidents_per_year)
## [1] 533.1905

Comcast dominates the early 2000s (particularly 2006 with a total of 1234 tobacco incidents). Sony dominates from 2008 to 2016 with highs in the year 2008 (998 tobacco incidents), 2011 (1044 tobacco incidents), 2013 (895 tobacco incidents), and 2016 (1073 tobacco incidents) Surprisingly, Disney dominated the chart in the year 2018 with 913 tobacco incidents. This disproves our initial hypothesis that Disney (the maker of children movies) would not top the charts of tobacco incidents. Warner Bros topped the charts in 2005 and 2019 with a total of 1037 and 1053 tobacco incidents, respectively. Comparing mean tobacco incidents from 2000 to 2022, Sony comes in first with an average 533.1905 tobacco incidents per year. Comcast comes in second with an average of 501.8571 tobacco incidents per year. Warner Bros comes in third with an average of 468.619 tobacco incidents per year. Disney comes in last with an average of 419.3333 tobacco incidents per year.

#GENERATING WORD CLOUD OF DIRECTORS 
director_i <- subset(final_clean, select = c("director"))

#SELECTING COLUMNS DIRECTOR AND TOBACCO INCIDENTS
director_incident3 <- subset(final_clean, select = c("director", "incidents_total"))

#GATHERING BY DIRECTOR AND INCIDENTS TOTAL
d1 <- gather(director_incident3, key = "director", value = "incidents_total")

#GROUPING BY DIRECTOR AND CALCULATING INCIDENTS PER DIRECTOR 
d2 <- d1 %>% group_by(director) %>% summarise(num = n(), incidents_per_director = sum(incidents_total)) 

#CREATNG NEW DATAFRAME CONSITING OF EACH DIRECTOR AND THEIR TOTAL TOBACCO INCIDENTS
matrix4 <- data.frame(word=d2$director, freq=d2$incidents_per_director)

#GENERATING WORD CLOUD FROM DIRECTOR AND THEIR TOTAL TOBACCO INCIDENTS
wordcloud(words = matrix4$word, freq = matrix4$freq, scale=c(2.5,.2), min.freq =100, max.words = 30, random.order = FALSE, rot.per=0.35, colors=brewer.pal(6,"Dark2"))

Ridley Scott and Quentin Tarantino dominate the director charts when it comes to total tobacco incidents.

#PREPPING DATA TO RUN LINEAR REGRESSION ANALYSIS BETWEEN YOUTH TOBACCO CONSUMPTION AND TOBACCO INCIDENTS IN PG/G MOVIES. AND TOBACCO IMPRESSIOONS IN THOSE MOVIES. 

#GETTING RID OF ROWS 1,2,3 TO BE UNIFORM WITH OTHER DATA
smokeless_tobacco_use_genderless_current1 <- smokeless_tobacco_use_genderless_current1  %>% filter(!row_number() %in% c(1,2,3))

#GETTING RID OF ROWS 1,2,3 TO BE UNIFORM WITH OTHER DATA
smokeless_tobacco_use_genderless_frequent1 <- smokeless_tobacco_use_genderless_frequent1  %>% filter(!row_number() %in% c(1,2,3))

#GETTING RID OF ROWS 1,2,3 TO BE UNIFORM WITH OTHER DATA
cigarette_use_genderless_current1 <- cigarette_use_genderless_current1 %>% filter(!row_number() %in% c(1,2,3))

#GETTING RID OF ROWS 1,2,3 TO BE UNIFORM WITH OTHER DATA
cigarette_use_genderless_frequent1 <- cigarette_use_genderless_frequent1  %>% filter(!row_number() %in% c(1,2,3))


#DELETING ROWS TO MAKE DATA UNIFORM TO MAKE INTO DATAFRAME
rating_impressions_dataframe <- rating_impressions_dataframe  %>% filter(!row_number() %in% c(18,17,19,20,21))

#DELETING ROWS TO MAKE DATA UNIFORM TO MAKE INTO DATAFRAME
rating_incidents_dataframe <- rating_incidents_dataframe %>% filter(!row_number() %in% c(18,17,19,20,21))

#CREATING FINAL DATAFRAME WHICH WE WILL PERFORM LINEAR REGRESSION ANALYSIS ON
final_linear <- data.frame(pg_g_incidents = rating_incidents_dataframe$pg_g_incidents, pg_13_incidents = rating_incidents_dataframe$pg_13_incidents, r_incidents = rating_incidents_dataframe$r_incidents, pg_g_impressions = rating_impressions_dataframe$pg_g_impressions, pg_13_impressions = rating_impressions_dataframe$pg_13_impressions, r_impressions = rating_impressions_dataframe$r_impressions, current_cigarette = cigarette_use_genderless_current1$current_smokers_per_year, frequent_cigarette = cigarette_use_genderless_frequent1$frequent_smokers_per_year, current_smokeless= smokeless_tobacco_use_genderless_current1$current_smokeless_per_year, frequent_smokeless = smokeless_tobacco_use_genderless_frequent1$frequent_smokeless_per_year)

#WE INITIALLY DIDN'T INCLUDE R-RATED MOVIE INCIDENTS BECAUSE WE THOUGHT KIDS WEREN'T SEEING THESE R RATED MOVIES MOVIES. OUR MORE RATIONAL AND REALISTIC THOUGHT PROCESS PREVAILED - WE DECIDED TO INCLUDE R-MOVIE DATA IN OUR REGRESSION ANALYSIS

We initially didn’t include R-rated movie incidents or impressions in our analysis because we thought kids weren’t seeing these movies. Our more rational and realistic though process prevailed - we decided to include R-movie data in our linear regression analysis.

#CURRENT YOUTH CIGARETTE SMOKERS LINEAR REGRESSION ANALYSIS

#RUNNING LINEAR REGRESSION ANALYSIS ON PG/G MOVIE TOBACCO INCIDENTS AND CURRENT YOUTH CIGARETTE SMOKERS
ggplot(final_linear, aes(x= pg_g_incidents, y= current_cigarette
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG/G Movie Tobacco Incidents",
    y = "Current Cigarette Youth Smokers",
    subtitle = "Correlation between PG/G Movie Tobacco Incidents and Current Youth Cigarette Smokers")
## `geom_smooth()` using formula = 'y ~ x'

#RUNNING LINEAR REGRESSION ANALYSIS ON PG-13 MOVIE TOBACCO INCIDENTS AND CURRENT YOUTH CIGARETTE SMOKERS
ggplot(final_linear, aes(x= pg_13_incidents, y= current_cigarette
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG-13 Movie Tobacco Incidents",
    y = "Current Cigarette Youth Smokers",
    subtitle = "Correlation between PG-13 Movie Tobacco Incidents and Current Youth Cigarette Smokers")
## `geom_smooth()` using formula = 'y ~ x'

#LINEAR REGRESSION ANALYSIS ON R MOVIE TOBACCO INCIDENTS AND CURRENT YOUTH CIGARETTE SMOKERS

ggplot(final_linear, aes(x= r_incidents, y= current_cigarette
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "R Movie Tobacco Incidents",
    y = "Current Cigarette Youth Smokers",
    subtitle = "Correlation between R Movie Tobacco Incidents and Current Youth Cigarette Smokers")
## `geom_smooth()` using formula = 'y ~ x'

#LINEAR REGRESSION ANALYSIS ON PG/G MOVIE TOBACCO IMPRESSIONS AND CURRENT YOUTH CIGARETTE SMOKERS

ggplot(final_linear, aes(x= pg_g_impressions, y= current_cigarette
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG/G Movie Tobacco Impressions",
    y = "Current Cigarette Youth Smokers",
    subtitle = "Correlation between PG/G Movie Tobacco Impressions and Current Youth Cigarette Smokers")
## `geom_smooth()` using formula = 'y ~ x'

#RUNNING LINEAR REGRESSION ANALYSIS ON PG-13 MOVIE TOBACCO IMPRESSIONS AND CURRENT YOUTH CIGARETTE SMOKERS

ggplot(final_linear, aes(x= pg_13_impressions, y= current_cigarette
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG-13 Movie Tobacco Impressions",
    y = "Current Cigarette Youth Smokers",
    subtitle = "Correlation between PG-13 Movie Tobacco Impressions and Current Youth Cigarette Smokers")
## `geom_smooth()` using formula = 'y ~ x'

#LINEAR REGRESSION ANALYSIS ON R MOVIE TOBACCO IMPRESSIONS AND CURRENT YOUTH CIGARETTE SMOKERS

ggplot(final_linear, aes(x= r_impressions, y= current_cigarette
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "R Movie Tobacco Impressions",
    y = "Current Cigarette Youth Smokers",
    subtitle = "Correlation between R Movie Tobacco Impressions and Current Youth Cigarette Smokers")
## `geom_smooth()` using formula = 'y ~ x'

#RESULTS
#TOBACCO INCIDENTS IN YOUTH RATED MOVIES IS INTERESTINGLY NOT THE PREDICTOR OF CURRENT YOUTH CIGARETTE SMOKERS, RATHER TOBACCO IMPRESSIONS IN MOVIES ARE.

So what are the results? Interestingly, tobacco incidents in youth rated movies is not the predictor of current youth cigarette smokers. Rather, tobacco impressions in youth rated movies are. This finding directly contradicts and disproves our initial hypothesis that, “there is a correlation between tobacco incidents within films and youth tobacco usage.”

Let’s assess whether these findings change when we run linear regression analysis on “frequent” youth cigarette smokers. For some reason our data set decided there was a resounding difference between current youth cigarette smokers and frequent youth cigarette smokers.

#FREQUENT CIGARETTE LINEAR ANALYSIS
#BECAUSE FOR SOME REASON OUR DATA SET USED DECIDED THERE WAS A RESOUNDING DIFFERENCE BETWEEN CURRENT YOUTH CIGARETTE SMOKERS AND FREQUENT YOUTH CIGARETTE SMOKERS

#LINEAR REGRESSION ANALYSIS ON PG/G MOVIE TOBACCO INCIDENTS AND FREQUENT YOUTH CIGARETTE SMOKERS
ggplot(final_linear, aes(x= pg_g_incidents, y= frequent_cigarette
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG/G Movie Tobacco Incidents",
    y = "Frequent Cigarette Youth Smokers",
    subtitle = "Correlation between PG/G Movie Tobacco Incidents and Frequent Youth Cigarette Smokers")
## `geom_smooth()` using formula = 'y ~ x'

#LINEAR REGRESSION ANALYSIS BETWEEN PG-13 MOVIE TOBACCO INCIDENTS AND CURRENT YOUTH CIGARETTE SMOKERS
ggplot(final_linear, aes(x= pg_13_incidents, y= frequent_cigarette
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG-13 Movie Tobacco Incidents",
    y = "Frequent Cigarette Youth Smokers",
    subtitle = "Correlation between PG-13 Movie Tobacco Incidents and Frequent Youth Cigarette Smokers")
## `geom_smooth()` using formula = 'y ~ x'

#LINEAR REGRESSION ANALYSIS BETWEEN R MOVIE TOBACCO INCIDENTS AND FREQUENT YOUTH CIGARETTE SMOKERS

ggplot(final_linear, aes(x= r_incidents, y= frequent_cigarette
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "R Movie Tobacco Incidents",
    y = "Frequent Cigarette Youth Smokers",
    subtitle = "Correlation between R Movie Tobacco Incidents and Frequent Youth Cigarette Smokers")
## `geom_smooth()` using formula = 'y ~ x'

#LINEAR REGRESSION ANALYSIS BETWEEN PG/G MOVIE TOBACCO IMPRESSIONS AND FREQUENT YOUTH CIGARETTE SMOKERS

ggplot(final_linear, aes(x= pg_g_impressions, y= frequent_cigarette
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG/G Movie Tobacco Impressions",
    y = "Frequent Cigarette Youth Smokers",
    subtitle = "Correlation between PG/G Movie Tobacco Impressions and Frequent Youth Cigarette Smokers")
## `geom_smooth()` using formula = 'y ~ x'

#LINEAR REGRESSION ANALYSIS BETWEEN PG-13 MOVIE TOBACCO IMPRESSIONS AND FREQUENT YOUTH CIGARETTE SMOKERS

ggplot(final_linear, aes(x= pg_13_impressions, y= frequent_cigarette
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG-13 Movie Tobacco Impressions",
    y = "Frequent Cigarette Youth Smokers",
    subtitle = "Correlation between PG-13 Movie Tobacco Impressions and Frequent Youth Cigarette Smokers")
## `geom_smooth()` using formula = 'y ~ x'

#LINEAR REGRESSION ANALYSIS ON R MOVIE TOBACCO IMPRESSIONS AND FREQUENT YOUTH CIGARETTE SMOKERS

ggplot(final_linear, aes(x= r_impressions, y= frequent_cigarette
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "R Movie Tobacco Impressions",
    y = "Frequent Cigarette Youth Smokers",
    subtitle = "Correlation between R Movie Tobacco Impressions and Frequent Youth Cigarette Smokers")
## `geom_smooth()` using formula = 'y ~ x'

#ONCE AGAIN, THIS TIME WITH FREQUENT YOUTH CIGARETTE SMOKING DATA, PG/G MOVIE TOBACCO IMPRESSIONS AND PG-13 MOVIE TOBACCO IMPRESSIONS HAD THE STRONGEST POSITIVE CORRELATION WITH FREQUENT YOUTH CIGARETTE SMOKERS. 

Once again, this time with frequent youth cigarette smoking data, PG/G Movie tobacco impressions and PG-13 Movie tobacco impressions had the strongest positive correlation with frequent youth cigarette smokers.

#CURRENT SMOKELESS YOUTH TOBACCO USERS

#LINEAR REGRESSION ANALYSIS ON PG/G MOVIE TOBACCO INCIDENTS AND CURRENT YOUTH SMOKELESS TOBACCO CONSUMPTION
ggplot(final_linear, aes(x= pg_g_incidents, y= current_smokeless
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG/G Movie Tobacco Incidents",
    y = "Current Youth Smokeless Tobacco Users",
    subtitle = "Correlation between PG/G Movie Tobacco Incidents and Current Youth Smokeless Tobacco Users")
## `geom_smooth()` using formula = 'y ~ x'

#LINEAR REGRESSION ANALYSIS ON PG-13 MOVIE TOBACCO INCIDENTS AND CURRENT YOUTH SMOKELESS TOBACCO CONSUMPTION
ggplot(final_linear, aes(x= pg_13_incidents, y= current_smokeless
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG-13 Movie Tobacco Incidents",
    y = "Current Youth Smokeless Tobacco Users",
    subtitle = "Correlation between PG-13 Movie Tobacco Incidents and Current Youth Smokeless Tobacco Users")
## `geom_smooth()` using formula = 'y ~ x'

#LINEAR REGRESSION ANALYSIS ON R MOVIE TOBACCO INCIDENTS AND CURRENT YOUTH SMOKELESS TOBACCO CONSUMPTION
ggplot(final_linear, aes(x= r_incidents, y= current_smokeless
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "R Movie Tobacco Incidents",
    y = "Current Youth Smokeless Tobacco Users",
    subtitle = "Correlation between R Movie Tobacco Incidents and Current Youth Smokeless Tobacco Users")
## `geom_smooth()` using formula = 'y ~ x'

#LINEAR REGRESSION ANALYSIS ON PG/G MOVIE TOBACCO IMPRESSIONS AND CURRENT YOUTH SMOKELESS TOBACCO CONSUMPTION
ggplot(final_linear, aes(x= pg_g_impressions, y= current_smokeless
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG/G Movie Tobacco Impressions",
    y = "Current Youth Smokeless Tobacco Users",
    subtitle = "Correlation between PG/G Movie Tobacco Impressions and Current Youth Smokeless Tobacco Users")
## `geom_smooth()` using formula = 'y ~ x'

#LINEAR REGRESSION ANALYSIS BETWEEN PG-13 MOVIE TOBACCO IMPRESSIONS AND CURRENT YOUTH SMOKELESS TOBACCO CONSUMPTION
ggplot(final_linear, aes(x= pg_13_impressions, y= current_smokeless
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG-13 Movie Tobacco Impressions",
    y = "Current Youth Smokeless Tobacco Users",
    subtitle = "Correlation between PG-13 Movie Tobacco Impressions and Current Youth Smokeless Tobacco Users")
## `geom_smooth()` using formula = 'y ~ x'

#LINEAR REGRESSION ANALYSIS ON R MOVIE TOBACCO IMPRESSIONS AND CURRENT YOUTH SMOKELESS TOBACCO USERS
ggplot(final_linear, aes(x= r_impressions, y= current_smokeless
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "R Movie Tobacco Impressions",
    y = "Current Youth Smokeless Tobacco Users",
    subtitle = "Correlation between R Movie Tobacco Impressions and Current Youth Smokeless Tobacco Users")
## `geom_smooth()` using formula = 'y ~ x'

Once again, although this time less strong, PG/G movie tobacco impressions and PG-13 movie tobacco impressions were positively correlated with current youth smokeless tobacco users. Tobacco incidents did not seem to be the highly correlated factor that led to increase in youth smoking rates. These conclusions are particularly important because this opens the possibility of future research to explore whether

Now let’s run linear regression analysis on frequent smokeless youth tobacco users:

#FREQUENT SMOKELESS YOUTH TOBACCO USERS
ggplot(final_linear, aes(x= pg_g_incidents, y= frequent_smokeless
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG/G Movie Tobacco Incidents",
    y = "Frequent Youth Smokeless Tobacco Users",
    subtitle = "Correlation between PG/G Movie Tobacco Incidents and Frequent Youth Smokeless Tobacco Users")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(final_linear, aes(x= pg_13_incidents, y= frequent_smokeless
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG-13 Movie Tobacco Incidents",
    y = "Frequent Youth Smokeless Tobacco Users",
    subtitle = "Correlation between PG-13 Movie Tobacco Incidents and Frequent Youth Smokeless Tobacco Users")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(final_linear, aes(x= r_incidents, y= frequent_smokeless
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "R Movie Tobacco Incidents",
    y = "Frequent Youth Smokeless Tobacco Users",
    subtitle = "Correlation between R Movie Tobacco Incidents and Frequent Youth Smokeless Tobacco Users")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(final_linear, aes(x= pg_g_impressions, y= frequent_smokeless
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG/G Movie Tobacco Impressions",
    y = "Frequent Youth Smokeless Tobacco Users",
    subtitle = "Correlation between PG/G Movie Tobacco Impressions and Frequent Youth Smokeless Tobacco Users")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(final_linear, aes(x= pg_13_impressions, y= frequent_smokeless
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "PG-13 Movie Tobacco Impressions",
    y = "Frequent Youth Smokeless Tobacco Users",
    subtitle = "Correlation between PG-13 Movie Tobacco Impressions and Frequent Youth Smokeless Tobacco Users")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(final_linear, aes(x= r_impressions, y= frequent_smokeless
)) + geom_point() +
  geom_smooth(method=lm, color="red", fill="#69b3a2", se=TRUE) +
  theme_ipsum() +
  labs(
    x = "R Movie Tobacco Impressions",
    y = "Frequent Youth Smokeless Tobacco Users",
    subtitle = "Correlation between R Movie Tobacco Impressions and Frequent Youth Smokeless Tobacco Users")
## `geom_smooth()` using formula = 'y ~ x'

Ultimately, the linear regression analysis shown above revealed there was no significant correlation between 1. PG/G movie tobacco incidents and “frequent” youth smokeless tobacco users, 2. PG-13 movie tobacco incidents and “frequent” youth smokeless tobacco users, 3. R movie tobacco incidents and “frequent” youth smokeless tobacco users, 4. PG/G movie tobacco impressions and “frequent” youth smokeless tobacco users, 5. PG-13 movie tobacco impressions and “frequent” youth smokeless tobacco users, 6. R movie tobacco impressions and “frequent” youth smokeless tobacco users.

#LET'S RUN A MORE IN DEPTH ANALYSIS ON PG-13 MOVIE TOBACCO IMPRESSIONS AND CURRENT YOUTH CIGARETTE SMOKERS.

#CREATING NEW MODEL OF PG-13 IMPRESSIONS AND CURRENT YOUTH CIGARETTE DATA 
new_model <- lm(pg_13_impressions ~ current_cigarette, data = final_linear)

#PLOTTING INTIAL LINEAR REGRESSION
par(cex.main =.7)
plot(final_linear$pg_13_impressions ~ final_linear$current_cigarette, main = "Linear Regression on PG_13 Tobacco Impressions \nand \nAmount of Current Youth Cigarette Smokers", xlab = "% of Current Youth Cigarette Smokers", ylab = "PG-13 Tobacco Impressions") + abline(new_model, col = "red")

## integer(0)
#PLOTTING RESIDUALS 
par(cex.main =.7)
plot(residuals(new_model) ~ fitted(new_model), main = "Residual Plot of PG_13 Tobacco Impressions \nand \nAmount of Current Youth Cigarette Smokers") 
abline(a = 0, b = 0, col = "red")

#CALLING SUMMARY TO GET R SQUARED AND P VALUES
summary(new_model)
## 
## Call:
## lm(formula = pg_13_impressions ~ current_cigarette, data = final_linear)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -6.479e+09 -2.856e+09 -4.103e+08  3.217e+09  5.374e+09 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       1.344e+09  3.048e+09   0.441  0.66597   
## current_cigarette 8.313e+08  2.659e+08   3.126  0.00744 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.634e+09 on 14 degrees of freedom
## Multiple R-squared:  0.411,  Adjusted R-squared:  0.3689 
## F-statistic:  9.77 on 1 and 14 DF,  p-value: 0.007443

Our R squared Value of .411 and p-value of .007433 illustrates a positive correlation between pg_13 film tobacco impressions and current youth cigarette smokers. As pg_13 film tobacco impressions increases, so will the number of current youth cigarette smokers. Our residual plot clusters towards the middle of the plot, indicating that the predictions were relatively accurate.

#NOW LET'S RUN A LINEAR REGRESSION ANALYSIS ON PG/G MOVIE TOBACCO IMPRESSIONS AND CURRENT YOUTH CIGARETTE SMOKERS

#CREATING NEW MODEL OF PG AND G MOVIE TOBACCO IMPRESSIONS AND CURRENT CIGARETTE DATA
newest_model <- lm(pg_g_impressions ~ current_cigarette, data = final_linear)

#PLOTTING LINEAR REGRESSION 
par(cex.main =.7)
linear <- plot(final_linear$pg_g_impressions ~ final_linear$current_cigarette, main = "Linear Regression on PG/G Tobacco Impressions\n and \nAmount of Current Youth Cigarette Smokers", xlab = "% of Current Youth Cigarette Smokers", ylab = "PG/G Tobacco Impressions") + abline(newest_model, col = "red")

#PLOTTING RESIDUALS
par(cex.main =.7)
residual <- plot(residuals(newest_model) ~ fitted(newest_model), main = "Residual Plot of PG/G Tobacco Impressions \n and \nAmount of Current Youth Cigarette Smokers", xlab = "Fitted Model", ylab = "Residuals of Newest Model") 
abline(a = 0, b = 0, col = "red")

#CALLING SUMMARY TO GET R SQUARED AND P VALUES
summary(newest_model)
## 
## Call:
## lm(formula = pg_g_impressions ~ current_cigarette, data = final_linear)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.055e+09 -4.115e+08 -1.219e+08  1.557e+08  1.598e+09 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)  
## (Intercept)       -768180641  576572134  -1.332   0.2040  
## current_cigarette  138480864   50314137   2.752   0.0156 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 687500000 on 14 degrees of freedom
## Multiple R-squared:  0.3511, Adjusted R-squared:  0.3048 
## F-statistic: 7.575 on 1 and 14 DF,  p-value: 0.01557

Our R squared Value of .3511 and p-value of .01557 illustrates a positive correlation between PG and G film tobacco impressions and current youth cigarette smokers. As PG and G film tobacco impressions increases, so will the number of current youth cigarette smokers. Our residual plot clusters towards the middle of the plot, indicating that the predictions were relatively accurate.

#GENERATING WORD CLOUD OF DIRECTORS TOBACCO IMPRESSIONS IN PG/G/PG-13 MOVIES


#FILTERING BY PG / G / PG-13 FILMS
pg_g_pg_13_films <- filter(final_clean, rating == "PG" | rating == "G" | rating == "PG-13")

#SELECTING JUST COLUMN DIRECTOR AND IMPRESSIONS 
director_incident4 <- subset(pg_g_pg_13_films, select = c("director", "impressions"))

#GATHERING BY DIRECTOR AND IMPRESSIONS
d5 <- gather(director_incident4, key = "director", value = "impressions")

#ADDING TOTAL IMPRESSIONS PER DIRECTOR
d7 <- d5 %>% group_by(director) %>% summarise(num = n(), impressions_per_director = sum(impressions)) 

#ORDERING BY DECREASING
d7_final <- d7[order(d7$impressions_per_director,decreasing =TRUE),]

#CALLING FINAL WORDCLOUD
wordcloud(words = d7_final$director, freq = d7_final$impressions_per_director, scale=c(3,.2), min.freq =100, max.words = 30, random.order = FALSE, rot.per=0.35,colors=brewer.pal(8,"Dark2"))

Given our new conclusion that tobacco Impressions in PG/G & PG-13 movies directly positively correlates to current cigarette and smokeless tobacco youth users, let’s re-run our initial director word-cloud; however, this time we’ll filter by Tobacco impression instead of incidents and only include PG/G & PG-13 movies. We can exonerate Ridley Scott and Quentin Tarantino from the list of directors that likely produced the greatest amount of youth tobacco users due to one reason: While Tarantino and Scott films both top the charts when it comes to tobacco incidents, their movies are more often rated R - ultimately, limiting the impression tobacco has on youth.

Ultimately, Peter Jackson now becomes the director whose movies feature the most tobacco impressions. This is most likely caused by his production of the Lord of the Rings series. While we want to be sure not to conflate causality with correlation, we think further research is needed to investigate whether the Lord of the Rings series potentially contributed to an increase in youth tobacco users. This further solidifies our recommendation that if PG-13 Movies with high amounts of tobacco incidents were to be rated as R, they would ultimately have less of a tobacco impression on youth, hypothetically resulting in fewer youth tobacco users.

Now let’s re-run our Production Company Tobacco incident analysis; however, this time we’ll use tobacco impressions instead. This will give us greater insight into which tobacco companies are influencing youth tobacco usage the most.

#SORTING SONY IMPRESSION DATA
sony_impression <- sony_data %>% group_by(release_year) %>% summarise(num = n(), impression_per_year = sum(impressions)) 

#SORTING DISNEY IMPRESSION DATA
disney_impression <- disney_data %>% group_by(release_year) %>% summarise(num = n(), impression_per_year = sum(impressions)) 

#SORTING WARNER BROS IMPRESSION DATA
warner_impression <- warner_bros_data  %>% group_by(release_year) %>% summarise(num = n(), impression_per_year = sum(impressions)) 

#SORTING COMCAST IMPRESSION DATA
comcast_impression <- comcast_data  %>% group_by(release_year) %>% summarise(num = n(), impression_per_year = sum(impressions)) 

#PLOTTING PRODUCTION COMPANY TOBACCO IMPRESSION DATA
plot(comcast_impression$release_year,comcast_impression$impression_per_year, col ='blue', xlab = "Year", ylab = "Tobacco Impression", main = "Tobacco Impression Per Production Company Over Time", sub = "Among Only Non Indie Prod. Companies")
lines(comcast_impression$release_year, comcast_impression$impression_per_year, col='blue',lwd=2)
lines(disney_impression$release_year, disney_impression$impression_per_year, col='green',lwd=2)
lines(warner_impression$release_year, warner_impression$impression_per_year, col='red',lwd=2)
lines(sony_impression$release_year, sony_impression$impression_per_year, col='purple',lwd=2)
#lines(viacom_new$release_year, viacom_new$incidents_per_year, col='yellow',lwd=2)
#lines(mgm_new$release_year, mgm_new$incidents_per_year, col='wheat',lwd=2)
#lines(dreamworks_new$release_year, dreamworks_new$incidents_per_year, col='lightpink1',lwd=2)
#lines(miramax_new$release_year, miramax_new$incidents_per_year, col='darkolivegreen',lwd=2)
#lines(lionsgate_new$release_year, lionsgate_new$incidents_per_year, col='cyan',lwd=2) 
legend(x = "topright", title="Production Company", legend = c("Comcast", "Disney",  "Warner Bros", "Sony"), fill = c("blue", "green", "red", "purple"), cex =.45) + coord_cartesian(clip = "off")

## NULL
# CALCULATING MEAN TOBACCO IMPRESSIONS BY PRODUCTION COMPANY
mean(warner_impression$impression_per_year)
## [1] 3830086058
mean(sony_impression$impression_per_year)
## [1] 3199819464
mean(disney_impression$impression_per_year)
## [1] 2628315533
mean(comcast_impression$impression_per_year)
## [1] 2776106993

Charting tobacco impressions by production company (instead of tobacco incidents) tells a drastically different story than soley looking at tobacco incidents. Warner Bros tobacco impressions spiked in 2012 with a total of 9,075,673,583 tobacco impressions (something previously invisible when just looking at just tobacco incidents.) Taking the average of all tobacco impressions within film from 2000 to 2022, Warner Bros leads the way with 3,830,086,058. For comparison Sony has an average tobacco impression of 3,199,819,464. Disney - 2,628,315,533 mean tobacco impressions. Comcast - 2,776,106,993 mean tobacco impressions.

Area of Further Research:

Despite the analysis listed above, more research is needed to definitively determine whether or not the relationship between tobacco impressions in film and youth tobacco rates are merely correlated or whether the former causes the latter. While we were limited by static data that we could not manipulate in order to establish causality, one way to remedy this would be to establish a controlled study. A control study in which two different groups (comparable in almost every way) receive different treatments (namely different levels of exposure to tobacco consumption in films) and the different results are recorded, is often regarded as the best way to establish causality between two variables. Due to ethical reasons, the probability of this study being carried out is nearly non existent due to the fact that it would be inappropriate and unethical to have one of the groups undergo an activity that potentially leads to harm (smoking). Regardless, future research is essential to continue studying how smoking(or any other activity) in film affects our own actions. Future research could also focus on how smoking in films affects different racial groups, genders, socioeconomic backgrounds, etc.

Conclusion:

Ultimately, our study revealed the importance of distinguishing between tobacco incidents and tobacco impressions - while both are invaluable - they tell drastically different stories. We proved a positive correlation between current youth cigarette users / current youth smokeless tobacco users and tobacco impressions within PG/G & PG-13 films.

Addressing our previous hypotheses we were able to prove and disprove an equal amount.

  1. The presence of smoking cigarettes in cinema has surprisingly remained consistent through 2020. Even more surprisingly, the presence of alternative smoking means (e-cigarettes, etc) has remained relatively low. 

  2. The number of overall tobacco incidents in cinema has remained relatively high throughout the the first two decades of the 2000’s.

  3. There is a correlation between tobacco impressions within films and youth tobacco usage. Surprisingly, tobacco incidents had no correlation with youth tobacco usage.

  4. Our initial hypothesis that production companies like Disney produce the least amount of tobacco incidents over time is true from an average standpoint, Disney still topped the tobacco incidents charts in the year 2018 with 918 tobacco incidents. Looking at mean tobacco incidents per year vs mean tobacco impressions per year told drastically different stories.

Ranked Mean Tobacco Incidents Per Year by Production Company:

  1. Sony

  2. Comcast

  3. Warner Bros

  4. Disney

Ranked Mean Tobacco Impressions Per Year by Production Company:

  1. Warner Bros

  2. Sony

  3. Comcast

  4. Disney

  5. While it’s true that Quentin Tarantino dominates the charts when it comes to tobacco incidents, this is not correlated with increased youth tobacco rates. Instead, after filtering films that feature the most amount of tobacco impressions in youth rated films - Tarantino no longer tops the list. Fascinatingly, charting director tobacco incidents vs tobacco impressions in youth movies tells drastically different stories.

    1. Directors with greatest tobacco incidents

      1. Ridley Scott

      2. Quentin Tarantino

    2. Directors with greatest tobacco impressions in youth rated (PG/G/PG-13) movies

      1. Peter Jackson

      2. Steven Spielberg

      3. Bryan Singer

We effectively exonerated Quentin Tarantino and Ridley Scott from causing increased youth tobacco usage (not to say that Peter Jackson caused any increase either because we did not prove causality but merely correlation.)

Final Recommendation

We would finally recommend that in order to limit tobacco impressions (something which is correlated with youth smoking cigarettes and smokeless means of tobacco) PG-13 movies that feature heavy amounts of tobacco incidents should be rated as R - effectively limiting the number of tobacco impressions.